Deconvolution of bulk tumors into distinct immune cell states predicts colorectal cancer recurrence

Summary Predicting colorectal cancer recurrence after tumor resection is crucial because it promotes the administration of proper subsequent treatment or management to improve the clinical outcomes of patients. Several clinical or molecular factors, including tumor stage, metastasis, and microsatellite instability status, have been used to assess the risk of recurrence, although their predictive ability is limited. Here, we predicted colorectal cancer recurrence based on cellular deconvolution of bulk tumors into two distinct immune cell states: cancer-associated (tumor-infiltrating immune cell-like) and noncancer-associated (peripheral blood mononuclear cell-like). Prediction model performed significantly better when immune cells were deconvoluted into two states rather than a single state, suggesting that the difference in cancer recurrence was better explained by distinct states of immune cells. It indicates the importance of distinguishing immune cell states using cellular deconvolution to improve the prediction of colorectal cancer recurrence.


INTRODUCTION
The prediction of recurrence in colorectal cancer patients is a challenging task. Colorectal cancer is the second leading cause of cancer-related deaths, with approximately 551,000 fatalities globally in 2018 (Bray et al., 2018). In particular, recurrence following surgery is one of the leading causes of patient mortality. After tumor excision, which is the most common treatment for colorectal cancer, 30-50% of patients experience recurrence and show a poor prognosis (Ryuk et al., 2014). For patients at a high risk of recurrence, adjuvant chemotherapy or intensive follow-up is advised to minimize the recurrence rate or detect recurrent tumors early (Desch et al., 2005;Osterman and Glimelius, 2018). Several clinical characteristics of patients are currently utilized to predict the risk of recurrence after surgery, such as tumor stage (Osterman and Glimelius, 2018), metastasis (Ryuk et al., 2014), or MSI status (Walker et al., 2014), although the prediction performance is still poor. The identification of new predictive markers for colorectal cancer recurrence is highly required in the field. fraction of cell types using omics data of bulk tumors. The methods use mathematical equations to calculate the proportion of each cell type in a bulk tumor, assuming that the gene expression of bulk tumors is a weighted sum of expression profiles of various cell types. MethylCIBERSORT , for example, dissects each cell type's contribution to the aggregated methylation signals in bulk tumors based on the methylation reference profiles of various cell types. Since those deconvolution methods heavily rely on the availability of accurate references, the investigation of the cell types for which reference omics data are not offered is limited.
Despite reports that the transcriptome and epigenome of immune cells might be drastically altered after tumor infiltration (Mehdi and Rabbani, 2021), omics data of immune cells from peripheral blood are still employed as reference profiles to examine the prognostic landscape of TIIC using deconvolution methods Craven et al., 2021). The tumor microenvironment can influence the expression and methylation patterns of immune cells, causing them to differentiate into cancer-associated (procancer or anticancer) cell types (Mehdi and Rabbani, 2021). Upon dendritic cell maturation or CD8 + T cell differentiation, DNA sequences containing transcription binding sites and promoters of the genes that control immune cell functions show dramatic alterations in methylation patterns, resulting in gene expression changes (Scharer et al., 2013;Zhang et al., 2014).
In this study, we examined the landscape of distinct immune cell states in bulk tumors and constructed a machine learning (ML) framework to predict colorectal cancer recurrence. To do so, we generated methylome data of various immune cell types from both tumors and peripheral blood and trained a cell deconvolution method to estimate the cellular proportion of TIIC-like and peripheral blood mononuclear cell (PBMC)-like cells in bulk tumors. We postulated that state-altered (TIIC-like, such as procancer or anticancer) and state-maintaining (PBMC-like, such as immature or bystander) immune cells coexisted in bulk tumors, each playing a different role in cancer immunity. We built an ML model to predict the recurrence of colorectal cancer patients based on the inferred proportion of TIIC-and PBMC-like cells in bulk tumors. The predictive performance of the ML model was tested using independent internal and external datasets. The ML model exhibited the best predictive performance when immune cells were deconvoluted into two different states rather than a single state, thus implying that separating the immune cell states is crucial for correctly predicting recurrence. Furthermore, our model outperforms conventional models using clinical data from patients, TNM stage, metastasis, and/or MSI status. We also examined the interpretability of the model by observing that the methylation biomarkers identified by our method were associated with the genes that control immune cell migration or activation.

Study design
We constructed a predictive framework that infers the composition of tumor-associated immune cells (TAICs) and PBMC-like immune cells from bulk methylome data and utilizes them to predict the risk of recurrence of patients with colorectal cancer. To identify epigenetic markers of TAICs, we compared the methylation patterns of tumor-infiltrating immune cells (TIICs) containing a relatively large number of TAICs with those of peripheral blood mononuclear cells (PBMCs) containing a relatively small number of TAICs. Specifically, we isolated epithelial cells, fibroblasts, and four types of TIICs (CD4 + , CD8 + T cells, DCs, and macrophages) in tumors and four types of PBMCs (CD4 + , CD8 + T cells, DCs, and monocytes) in peripheral blood from seven colorectal cancer patients. Then, we obtained the methylation patterns of CpG sites from the isolated cells by targeted bisulfite sequencing (Figure 1). Through a principal component analysis, we observed that the methylation patterns were distinct depending on whether they infiltrated into cancer or not, even for the same type of immune cells ( Figure S1), suggesting that TIICs and PBMCs can be distinguished by epigenetic markers. Using the dataset, we (i) built a cell deconvolution model inferring TIIC-and PBMC-like immune cells in bulk tumors and (ii) developed a machine learningbased model predicting the recurrence of colorectal cancer patients by using the inferred cellular compositions. To validate the prediction performance, we used 114 colorectal cancer patients from the Samsung Medical Center (SMC) cohort, with 46 exhibiting recurrence and 64 exhibiting nonrecurrence based on an average 5-year follow-up after tumor resection.
We built a deconvolution model for the TIIC + PBMC approach and two negative control models using only methylation patterns of either TIICs (TIIC-based approach) or PBMCs (PBMC-based approach) ( Figure S2). To construct deconvolution models, we leveraged MethylCIBERSORT  iScience Article automatically detects signature methylation patterns and created a model to infer the proportions of cell types from bulk tumor data. Using MethylCIBERSORT, we defined three methylation signatures for TIIC + PBMC-, TIIC-, and PBMC-based approaches, each of which consists of 1,616, 423, and 538 CpG sites, respectively (Table S1), and subsequently trained three deconvolution models for the three approaches. Using the trained deconvolution models, we inferred cellular compositions in bulk tumors of 114 colorectal cancer patients, which were later used to predict the recurrence of the patients.
With the inferred cellular compositions, we developed machine learning-based models to predict the recurrence of colorectal cancer patients. In antitumor immunity, interactions between various immune cell types are crucial (Luca et al., 2021;Steen et al., 2021). To train the association of immune cell proportions with cancer recurrence, we used a machine learning technique, the ExtraTree classifier (Geurts et al., 2006). We expect that there will be certain combinations of immune cells that can predict cancer recurrence.
To this end, we tested all combinations of ten cell types in our machine learning pipeline. To obtain reliable accuracy estimation, we randomly split the dataset of 114 patients at a ratio of 7:3 to obtain 70% as the training set and the remaining 30% as the test set. We repeated this random splitting 100 times and obtained 100 accuracy measurements for each cell type combination. Based on the average accuracy measurement, we chose the top performing cell type combinations.
Performance of the TIIC + PBMC approach to predict recurrence of colorectal cancer We compared the predictive performance of the three approaches, TIIC + PBMC, TIIC-based, and TBMCbased, and found that the TIIC + PBMC approach outperformed the others. The area under the curve (AUC) of the top 10 performing cell combinations of the TIIC + PBMC approach ranged from 0.67 to 0.69 (Figure 2A). In contrast, the AUCs of the top 10 cell combinations of the TIIC-and PBMC-based approaches ranged from 0.55 to 0.61 and from 0.50 to 0.57, respectively. The AUC of the top performing cell combination of the TIIC + PBMC approach was significantly higher than those of the TIIC-and PBMC-based approaches (p value = 4.5 3 10 À10 and 8.3 3 10 À18 , respectively). Specifically, when the proportions of TIIC-like CD8 + T cells, DCs, and PBMC-like DCs were trained, the performance to predict recurrence was the best (AUC = 0.69). DCs infiltrate the tumor to take up antigens and activate CD8 + T cells through cross-presenting exogenous antigens to kill cancer cells (Fu and Jiang, 2018). The interaction between the two cell types is crucial for antitumor immunity. DCs and CD8 + T cells are frequently used as good  , selected 1,616 signature CpG sites whose methylation levels are distinct across fibroblasts, epithelial cells, tumor-infiltrating immune cells (TIICs), and peripheral blood mononuclear cells (PBMCs). TIIC and PBMC indicate CD4 + , CD8 + T cells, dendritic cells, and macrophages/monocytes from tumor and peripheral blood. The method trained the methylation levels of significant CpG sites to infer the proportion of cells in patients' bulk tumors. A 5-year recurrence of patients was predicted with the inferred cellular compositions using a machine learning technique, ExtraTree (Geurts et al., 2006), Random Forest (RF) (Breiman, 2001), Extreme Gradient Boosting (XGBoost) classifiers (Chen and Guestrin, 2016). iScience Article prognostic markers in cancer studies (Fridman et al., 2017). As reflected in these reports, TIIC-like DCs and CD8 + T cells are frequently included in the top 10 cell type combinations (10 and 5, respectively).
In the TIIC + PBMC approach, we deconvoluted bulk tumors into TIIC-and PBMC-like cells because we assumed that the states of immune cells in the patients' bulk tumors are heterogeneous, and each has a different effect on cancer recurrence. We confirmed that our assumption was valid in predicting the recurrence of colorectal cancer. We found that the proportion of TIIC-like immune cells significantly better predicts cancer recurrence when considering the methylation pattern of PBMCs together ( Figure S3). Specifically, we observed that the performance using only the proportions of TIIC-like immune cells from the TIIC + PBMC approach was significantly higher than that using the proportions of TIICs from the TIIC-based approach (p value = 3.9 3 10 À7 ). Similarly, the predictive performance using the proportions of PBMC-like immune cells from the TIIC + PBMC approach was higher than those from the PBMC-based approach (p value = 9.3 3 10 À3 ).
We confirmed that our results are consistent when we use other tree-based machine learning techniques, random forest (RF) classifiers (Breiman, 2001) and extreme gradient boosting (XGBoost) (Chen and Guestrin, 2016). When we used the RF classifier, the AUC of the top performing cell combination in the TIIC + PBMC approach was 0.67, which was significantly higher than the AUC of the top performing cell combination in the TIIC-(AUC = 0.64 and p value = 7.3 3 10 À3 ) or PBMC-based methods (AUC = 0.60 and p value = 1.5 3 10 À7 ) ( Figure S4). When we used the XGBoost classifier, the AUC of the top performing cell combination in the TIIC + PBMC approach was 0.67, which was significantly higher than the AUC of the top performing cell combination in the TIIC-(AUC = 0.60 and p value = 4.6 3 10 À7 ) or PBMC-based methods (AUC = 0.56 and p value = 3.8 3 10 À15 ) ( Figure S4). This means that the successful prediction of cancer recurrence using the TIIC + PBMC deconvolution approach is robust regardless of the machine learning techniques.
We validated that the process of cell deconvolution of the patient's bulk tumor had a significant effect on predicting colorectal cancer recurrence. When predicting cancer recurrence by training the methylation levels of the 1,616 signature methylation features from the deconvolution model, the AUC was 0.63, which was significantly lower than the AUC using the cell deconvolution result ( Figure S5, p value = 1.4 3 10 À6 ). Moreover, we also found that the AUC training the methylation levels of 985 signature methylation features related to TIIC-like CD8 + T cells, DCs, and PBMC-like DCs, which were the best cell combination for predicting cancer recurrence, was still significantly lower than the AUC using the cell deconvolution result (p value = 2.1 3 10 À6 ). This suggests that the inferred cellular composition, which is the cell type-level abstraction, is more useful in predicting cancer recurrence than unprocessed loci-level methylation patterns.
We further validated the predictive performance of our model using an independent dataset, The Cancer Genome Atlas (TCGA). It provides recurrence data, clinical data, and methylation data for primary tumors in 98 colon adenocarcinoma patients (TCGA-COAD) and 8 rectal adenocarcinoma patients (TCGA-READ). We found that using the proportions of fibroblasts, epithelial cells, TIIC-like macrophages, PBMC-like CD4 + T cells, and monocytes from the TIIC + PBMC approach, cancer recurrence was predicted with an AUC of 0.69 ( Figure 2B). In contrast, the AUCs of the top performing cell combinations of the TIIC-and PBMC-based approaches were 0.61 and 0.58, respectively, which were significantly lower than the AUCs of the TIIC + PBMC approach (p values are 1.3 3 10 À34 and 1.1 3 10 À34 , respectively). Specifically, we deconvoluted the bulk tumors of TCGA patients using three deconvolution methods and predicted recurrence using a machine learning model trained with the deconvolution results of 114 colorectal cancer patients from the SMC cohort. This result suggests that our predictive strategy is also applicable to other races or cohorts, although the best feature of cell combinations was different across cohorts. Clinical data of patients at the time of surgery, such as metastasis, microsatellite (MSI) status, and TNM stage, are associated with the recurrence of colorectal cancer after surgical excision of the primary tumor (Osterman and Glimelius, 2018;Ryuk et al., 2014;Walker et al., 2014). We found that integrating the clinical data with the cell deconvolution results improved the performance of cancer recurrence prediction ( Figure 2C). Specifically, we evaluated the performance of models that integrate all combinations of clinical data with the deconvolution result of the top performing cell combination. The performance significantly increased when training the TNM stage, metastasis, and MSI status with the deconvolution result of the TIIC + PBMC approach (AUC = 0.74) compared to when only the deconvolution result was used (AUC = 0.69 and p value = 8.2 3 10 À6 ). On the other hand, in the case of the TIIC-and PBMC-based approaches, even when integrated with clinical data, the AUCs were 0.57 and 0.64, respectively, confirming that the performance was significantly lower than that of the TIIC + PBMC approach integrated with clinical data (p values are 1.9 3 10 À24 and 8.5 3 10 À14 , respectively). In the TCGA cohort, we reconfirmed that integrating the TIIC + PBMC approach with MSI status significantly improved performance (AUC = 0.70, p value = 3.0 3 10 À12 ). In contrast, the AUCs of the integrating TIIC-and PBMC-based approaches with MSI status were significantly lower than the AUC of the integrating TIIC + PBMC approach with MSI status ( Figure 2D, p values are 1.3 3 10 À34 and p value = 1.3 3 10 À34 , respectively). For the cross-cohort validation with TCGA cohort, the integration of TNM stage with the deconvolution result showed poor prediction performance ( Figure S6). Notably, we observed that the contribution of the deconvolution result to the -prediction performance is significant in the integrative approach with the clinical data. When only the clinical data were used, performances were significantly lower than using the deconvolution results together in both SMC and TCGA cohorts (Figures 2C and S6, SMC cohort: AUC = 0.61 and p value = 9.2 3 10 À22 ; TCGA cohort: AUC = 0.50 and p value = 2.8 3 10 À39 ).
The class of recurrent and nonrecurrent CRC patients for the SMC cohort was balanced (42%:58%), but that for the TCGA cohort was not (17%:83%). To evaluate the performance insensitive to the class imbalance of the TCGA cohort, we measured an additional metric, Cohen's Kappa, which assess the agreement between two raters. Consistent with our previous results, we observed that the performance of the TIIC + PBMC approach was significantly higher than those of TIIC-and PBMC-based approaches ( Figure S7, p values are 5.5 3 10 À30 and 2.3 3 10 À35 , respectively.) Specifically, the average of Cohen's Kappa of the TIIC + PBMC was 0.19, whereas those of TIIC-and PBMC-based approaches were 0.13 and 0.11, respectively. In addition, the improved performance was consistently observed in the results integrating clinical data ( Figure S8).
Recently, tumor location and the presence of adjuvant therapy has been suggested to have prognostic impact in colorectal cancer (Shida et al., 2020). We investigated the impacts of the two clinicopathological characteristics in predicting cancer recurrence and found that they were not effective in the prediction. There was no significant difference in tumor location and treatment between recurrent and nonrecurrent patients from the SMC cohort ( Figure S9). When it comes to using machine learning, tumor location and treatment showed poor performances to predict cancer recurrence (Figures S10 and S11, AUC ranged from 0.50 to 0.58). In addition, the integration of the two characteristics with the immune cell deconvolution resulted in no improvement of predictive performances. We excluded the features of tumor location and adjuvant therapy from the following analyses.
To interpret the machine learning model to predict the recurrence of patients with colorectal cancer, we compared the distributions of cell type proportions and clinical data used as prediction features for patient groups predicted as nonrecurrence or recurrence. We found that the patient groups predicted as nonrecurrence and recurrence were divided into three and two groups, respectively ( Figure 2E). The first group predicted to have nonrecurrence (NR-1) showed a high frequency of TIIC-like DCs in patients' bulk tumors. Immunogenic DCs play an important role in anticancer immunity, such as presenting tumor antigens and delivering antigens to lymph nodes for adaptive immunity against tumors (Gardner and Ruffell, 2016). This group corresponds to the report that colorectal cancer patients with a large amount of DC infiltration have better overall survival than those who do not (Schwaab et al., 2001). The second group predicted to have nonrecurrence (NR-2) had a large distribution of TIIC-like CD8 + T cells and DCs. DCs activate CD8 + T cells through cross-presenting exogenous antigen and kill tumor cells (Fu and Jiang, 2018). Crosstalk between CD8 + T cells and DCs is crucial for antitumor immunity (Fu and Jiang, 2018). iScience Article neoantigen, various immune cells, including antigen-presenting cells from peripheral blood, rapidly infiltrate into the tumor (Llosa et al., 2015), and as a result, it is associated with a favorable prognosis (Deschoolmeester et al., 2011). In contrast, the first group predicted as recurrence (R-1) had a small population of TIIC-like CD8 + T cells and DCs, showing the opposite distribution of immune cells to the NR-2 group. The second group predicted to have recurrence (R-2) was characterized by metastasis, which is consistent with reports that patients with lymph node metastasis have a high recurrence rate after cancer resection surgery (Asano et al., 2017).

Functional analysis of genes mapped to CpG sites used in cell deconvolution
We observed that the CpG sites used in the TIIC + PBMC approach tended to show significantly different methylation levels between TIICs and PBMCs of the same immune cell types. To deconvolute patients' bulk tumors, the TIIC + PBMC-, TIIC-, and PBMC-based approaches utilized methylation patterns of 1,616, 423, and 538 CpG sites, respectively ( Figure 3A). Of the 1,616 signature CpG sites used in the TIIC + PBMC approach, 636 CpG sites showed significantly different methylation levels between TIICs and PBMCs of the same immune cell types. However, in the case of TIIC-and PBMC-based approaches, only 53 and 72 sites were differentially methylated between TIIC and PBMCs, respectively ( Figure 3B). These two approaches are not suitable to differentiate distinct immune states, TIIC-and PBMC-like immune cells.
We found that the DMCs between TIICs and PBMCs used in the TIIC + PBMC approach tended to be located near genes that regulate the migration of immune cells ( Figure 3C). For functional characterization of TIIC-PBMC, TIIC-TIIC, and PBMC-PBMC DMCs used in the TIIC + PBMC approach, we discovered the regulatory elements (REs) located within 1,250 bp from each CpG site and mapped REs to their target genes. Using a functional enrichment test, we found that the TIIC-PBMC DMCs were significantly enriched in genes that control the process of immune cell migration to tissues, such as 'thymocyte migration For the functional characterization by normalizing the CpG density present in base-level methylation data, functional annotation of signature CpG sites was also performed using GOmeth (Maksimovic et al., 2021;Phipson et al., 2016). We observed a similar tendency ( Figure S12) to the previous results. The TIIC-PBMC DMCs are associated with genes that control immune cell migration, such as 'positive regulation of leukocyte migration (GO:0002687)', and genes that modulate immune cell activation, such as 'leukocyte activation (GO:0045321)'. The TIIC-TIIC DMCs are associated with genes that regulate various immune cell activation, such as 'myeloid leukocyte activation (GO:0002274)'. On the other hand, the PBMC-PBMC DMCs are associated with genes crucial for immune cell homeostasis, such as 'regulation of lymphocyte chemotaxis (GO:1901623)'. These results suggest that without prior knowledge, our deconvolution model captured biologically valid information.
As an example of TIIC-PBMC DMCs, four signature CpG sites of the TIIC + PBMC approach are located around the enhancer regulating the DOCK8 gene ( Figure 3D). These CpG sites are likely to affect the expression of the DOCK8 gene because they are close to the DOCK8 gene and 608 bp, 619 bp, 653 bp, and 655 bp apart. The four sites were significantly hypermethylated in the TIICs of CD4 + , CD8 + T cells, DCs, and macrophages compared to the corresponding PBMCs ( Figure 3E, p values ranged from 2.6 3 10 À2 to 6.5 3 10 À3 ). DOCK8 is a guaninenucleotide exchange factor that activates CDC42, which regulates actin polymerization and cytoskeleton rearrangement to control the migration of T cells (Xu et al., 2017) and DCs (Harada et al., 2012). DOCK8 deficiency results in immunodeficiency and increased cancer risk, thus supporting that the epigenetic marker for DOCK8 is associated with recurrence in colorectal cancer patients. iScience Article

Hypomethylation of the CpG sites near immunogenic DC markers in TIIC DCs
As shown in the previous results in Figure 2, we observed a lower risk of recurrence in colorectal cancer patients with a high proportion of TIIC-like DCs. We also discovered that the TIIC-like DCs is crucial for predicting cancer recurrence. Specifically, we measured feature contribution using pRF, which estimates the significance of feature importance by permutating the response variable. The TIIC-like DCs showed significant feature importance ( Figure S13, p value = 9.9 3 10 À3 ), which means that they are capable of increasing the performance in predicting cancer recurrence.
However, according to recent studies, not all tumor-infiltrating DCs show a favorable prognosis for colorectal cancer (Gardner and Ruffell, 2016). Tumor-infiltrating immunogenic DCs transport cancer-associated antigens to the draining lymph node and induce T cell priming to initiate anticancer immunity (Roberts et al., 2016). Thus, immunogenic DCs in the patient's bulk tumor are associated with a favorable prognosis of patients with colorectal cancer (Schwaab et al., 2001). However, some tumor-infiltrating DCs become suppressive in the presence of cytokines such as IL-6 and M-CSF secreted by cancer cells and inhibit anticancer immunity by inactivating immune cells (Zong et al., 2016).
To examine whether the TIIC DCs used in our study were immunogenic or suppressive, we compared the methylation levels of the CpG sites near REs regulating the gene markers of immunogenic DCs between TIIC and PBMC DCs. It is based on a report that DC development and maturation are associated with a great loss of DNA methylation of RE regulating immunogenic DC gene markers, such as IL10 and CCR7 (Zhang et al., 2014). We found that the sites near REs that control immunogenic DC markers tended to be significantly hypomethylated in TIIC DCs compared to PBMC DCs ( Figure 4A). Specifically, we investigated the DMCs between TIIC and PBMC DCs from colon cancer patients and checked the distribution of DMCs on the sites near REs associated with seven immunogenic DC markers. The DMCs near the REs regulating five immunogenic DC gene markers (HLA-DRA, CCR7, CD40, CCL22, IFNG) were significantly hypomethylated in TIIC DCs (p values ranged from 3.9 3 10 À3 to 4.3 3 10 À24 ). These results showing that TIIC DCs are immunogenic support the validity of our model because patients with a high proportion of TIIC-like DCs are anticipated to have a low recurrence rate.
As an example of immunogenic DC markers, HLA-DRA constitutes the MHC class II (MHC-II) complex, which plays a key role in antitumor immunity by displaying neoantigen peptides and activating T cells. MHC class II molecules are frequently used as markers of immunogenic DCs (Hayashi et al., 2020). We found 30 DMCs within 2,500 bp from enhancers that regulate HLA-DRA, and all of them were significantly hypomethylated in TIIC DCs compared to PBMC DCs ( Figure 4B). The 30 DMCs were distributed in the regions near five enhancers regulating HLA-DRA, and their methylation levels were 32% less on average in TIIC DCs than in PBMC DCs. These results indicated that the TIIC DCs used in this study show the properties of immunogenic DCs.

DISCUSSION
In this study, we predicted the recurrence of colorectal cancer using the deconvolution of bulk tumors into distinct states of immune cell types based on the methylation profiles of TIICs and PBMCs. Our results indicated that recurrence predictions were improved when using methylation profiles of TIICs and PBMCs together compared with the profiles of TIICs or PBMCs alone (Figures 2A, 2B, and S3). Our approach is based on the notion that the cellular phenotypes and functions of TIICs are heterogeneous, and they differentially affect cancer recurrence. Consistent with our results, it has been found that the prognostic capabilities of immune cell subtypes found in tumors are different. For example, in non-small-cell lung cancer (NSCLC), there were distinct amounts of subtypes of tumor-infiltrated T cells, such as naive, memory, and effector T cells (Sheng et al., 2017). Moreover, the production levels of TNF-a, an important cytokine in anticancer immunity, varied among individual subtypes of tumor-infiltrated T cells (Sheng et al., 2017). In addition to T cells, tumor-infiltrated monocytes also play multifaceted roles in anticancer immunity. They mediate both protumoral and antitumoral responses along with the tumor microenvironment (Guilliams et al., 2018;Jeong et al., 2019).
There have been efforts to predict the clinical outcomes of cancer patients by using immune cell compositions, since infiltration of immune cells into tumors is one of the important factors for clinical outcomes such as survival (Barnes and Amir, 2017), metastasis (Gonzalez et al., 2018), and drug response (Waldman et al., 2020). Therefore, many studies utilize expression or methylation profiles derived from PBMCs to infer ll OPEN ACCESS iScience 25, 105392, November 18, 2022 9 iScience Article immune cell compositions for the prediction of patients' clinical outcomes Craven et al., 2021). However, upon infiltration, the expression and methylation profiles of immune cells were significantly changed, since the tumor microenvironment affected the cells (Figures 3B, 3D, 3E, and 4). For example, the expression profiles of tumor infiltrating regulatory T cells (Tregs) were distinct from those of Tregs in PBMCs (DeSimone et al., 2016). Intriguingly, upregulated genes in tumor-infiltrating Tregs were significantly enriched for the activation of immune cells (DeSimone et al., 2016). Consistent with this finding, the methylation levels of genes related to immunogenic DCs were hypomethylated in TIIC DCs compared to PBMC DCs (Figure 4). These results suggest that methylation profiles from TIICs provide additional information about the role of immune cells in the tumor microenvironment, distinct from those in peripheral blood, thereby enabling improvement of recurrence prediction.
Our results indicated that abstraction of methylation information to depict immune cell compositions based on the deconvolution method improved the prediction of recurrence in colorectal cancer patients (Figures 2 and S6). The performance of predicting cancer recurrence was reduced when the methylation profiles derived from patients were directly used without immune cell deconvolution ( Figure S5). This result suggests that the transformation of the methylation profile from locus information into cellular information using a deconvolution method enhances the predictive performance of cancer recurrence. The advantage of the information transformation before constructing a predictive model can also be found in an example of autonomous driving. Compared to directly using pixel data of camera images, known as end-to-end learning, the performance of driving path planning was improved when camera images were processed with nearby vehicles or pedestrians before planning (Ng, 2019). Camera images themselves contain too iScience Article many features with unmeaningful pixels to train a model for planning a path. However, the information transformation of image pixels into objects enables a model to be trained easier and more precisely by specifying meaningful features. Likewise, because the deconvolution method transforms complicated information into simple and meaningful information, it has successfully been utilized in many cancer studies, such as for finding prognostic markers  and predicting the response to chemotherapy (Ali et al., 2016) or immunotherapy (Charoentong et al., 2017).
We envision that our work here offers new perspectives for predicting colorectal cancer recurrence after surgery. First, we developed a standard model ensuring the reproducibility of predicting colorectal cancer recurrences. Briefly, we constructed a machine learning model based on cellular deconvolution of bulk tumors into two distinct immune cell states to predict cancer recurrence. The model makes robust performances in predicting cancer recurrence across independent cohorts with patients of different races (Figure 2, SMC cohort: 100% Asian, TCGA cohort: 79% White, 18% Black or African American, and 3% Asian). It suggests that recurrence prediction based on cell deconvolution across immune cell states can be applied to diverse cohorts. Second, the platform we suggested in this study makes a recurrence prediction without additional biopsy using residual tumor samples after surgery. Tumor resection is one of the most preferred treatments for colorectal cancer patients. Our platform only needs to profile the methylation patterns of a few CpG sites in the bulk tumor obtained after tumor resection. We provide a list of CpG sites to be profiled and a machine learning model for recurrence prediction through an open resource, expecting the platform facile to be employed in clinical practice. Third, deconvolution signature was synergistic to histologies which have been previously reported to have a prognostic impact on recurrence prediction. We observed that integrating the immune cell deconvolution with the clinical data improved the performance of cancer recurrence prediction ( Figures 2C and 2D). The performance significantly increased when combining the immune cell deconvolution with histology currently employed in clinical practice to predict recurrence (AUC = 0.74) compared to only the histology used (AUC = 0.61 and p value = 9.2 3 10 À22 ). This result suggests the applicability of cell deconvolution in clinical practice.

Limitations of the study
A potential limitation of using immune cell states for the cancer recurrence prediction is a limited performance because of the use of a single omics layer. To improve the prediction performance, multi-omics information is required. We observed that the incorporation of clinical information such as TNM stage and metastasis with immune cell composition improved the predictive performance of recurrence in colorectal cancer patients ( Figures 2C and 2D). It suggests that integration of data obtained from other layers could enhance the predictive performance of recurrence in colorectal cancer patients because recurrence of colorectal cancer is caused by heterogeneous molecular mechanisms (Augestad et al., 2017). Besides this study, other multilayered information including transcriptome (Tian et al., 2017) or genomic mutations (Hutchins et al., 2011) are used to predict the recurrence of colorectal cancer. We expect that our approach, when combined with additional information, could improve the predictive ability of recurrence in colorectal cancer patients.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:  iScience Article iScience Article Any additional information required to reanalyze the data reported in this paper is available from the lead contacton request.

Patient and sample collection
This study was approved by the institutional review boards of the Samsung Medical Center (approval no. SMC 2018-04-074-004). Written informed consent was obtained from all subjects. All experimental methods complied with the Helsinki Declaration. Seven colorectal cancer patients were recruited to identify signature CpG sites of TIICs and PBMCs. As a retrospective cohort to predict cancer recurrence within 5 years, 114 colorectal cancer patients were recruited. The retrospective cohort with 114 patients was designated as SMC cohort in this study. Detailed information including age, sex, and clinical records of patients in SMC cohort at the time of surgery was provided in Table S2.

Data curation of colorectal cancer patients (The cancer genome Atlas (TCGA) cohort)
We downloaded DNA methylation profiles of bulk primary tumors in colorectal adenocarcinoma (COAD) and rectal adenocarcinoma (READ) patients from the GDC data portal repository. We only used tumor samples from patients whose methylation profiles were measured by Illumina Human Methylation 450 (Infiniumâ HumanMethylation450 BeadChip). To avoid sample redundancy, meaning two or more samples from one patient, we used a sample in which methylation levels of more CpG sites were measured. Beta values were utilized as methylation levels of individual CpG sites.

Machine learning-based prediction of cancer recurrence using cell proportions inferred from MethylCIBERSORT
We used extremely randomized trees (Geurts et al., 2006), extreme gradient boosting (Chen and Guestrin, 2016), and random forest (Breiman, 2001), which are decision tree-based ensemble machine learning models, for the prediction of recurrence in colorectal cancer patients. To predict patient recurrence, the cellular proportions of bulk tumors from individual patients were used as an input for machine learning models. iScience Article learning models. Ascending colon, hepatic flexure colon, transverse colon, and cecum were assigned as right-sided. Rectum, sigmoid colon, splenic flexure colon, st, rectosigmoid junction rectum, and descending colon were assigned as left-sided. To validate the performance in predicting the recurrence of the SMC cohort, we used Monte Carlo cross-validation. We randomly selected 70% of the SMC cohort to train the models, and the prediction performances of the models were investigated using the remaining 30% of patients. The model constructions and performance measurements were performed 100 times iteratively (100 times Monte Carlo cross-validation).
To validate the prediction performance of the model for the TCGA cohort, we trained the model using the immune cell proportions (or incorporating clinical information) of the SMC cohort and predicted the recurrence of TCGA patients. The model constructions and performance measurements were performed 100 times iteratively. The performances of machine learning models were measured by the area under the receiver operating characteristic curve (AUC). For TCGA cohort, Cohen's kappa was additionally used to measure performance of machine learning models because of the imbalance between recurrence and nonrecurrence. The prediction procedures were implemented in R by using the extraTrees package for extremely randomized trees, xgboost package for extreme gradient boosting, randomForest package for random forest, and ROCR package for AUC measurement, psych package for Cohen's kappa measurement.

Functional characterization of the signature CpG sites used in the deconvolution model
For functional characterization of the signature CpG sites used in the deconvolution models, we mapped each CpG site into a gene whose expression is likely affected by the methylation of the CpG site. To do so, we discovered regulatory elements (REs) upstream and downstream of 1,250 bp from each CpG site based on the report that methylation levels near REs can affect the expression of their target genes (Xiong et al., 2018). We downloaded promoter-gene and enhancer-gene mappings annotated in Marbach et al. (2016) from http://regulatorycircuits.org/. Marbach et al. provided the regions of REs and their target genes defined by cap analysis of gene expression (CAGE) data from the FANTOM5 project (Forrest et al., 2014) for approximately 1,000 human tissue and cell lines, which enabled the mapping of transcription start sites with high sensitivity. Then, we analyzed GO term enrichment for the genes mapped by the CpG sites. We used immune response-related GO terms obtained from all child terms of 'immune response (GO:0006955)'. Enrichment tests were performed to calculate an adjusted p value using the hypergeometric test and Benjamini-Hochberg correction procedure.
For the functional characterization by normalizing the CpG density, we calculated GO term enrichment of DMCs using GOmeth function from the missMethyl R package (Maksimovic et al., 2021;Phipson et al., 2016). TIIC-PBMC, TIIC-TIIC, and PBMC-PBMC DMCs were used as input of GOmeth function, and all CpG sites in the panel were used as background. We used the option, plot.bias=TRUE, and default settings were used for any other options.

Measurement of feature importance using pRF
To evaluate the feature importance of each cell type proportion for patient recurrence prediction, we adopted pRF (https://cran.r-project.org/web/packages/pRF), which estimated statistical significance of feature importance by permuting the response variable. The cellular proportions deconvoluted by TIIC + PBMC were used as an input of pRF, and hyperparameters, n.perms = 100, mtry = 10, and type = ''classification'' were used.

Comparison of the methylation levels of CpG sites near immunogenic DC markers between TIIC and PBMC DCs
To compare the methylation levels of TIIC and PBMC DCs, we used dispersion shrinkage for sequencing (DSS) (Feng et al., 2014), which is a statistical method to detect differentially methylated regions based on a beta-binomial regression model. Differentially methylated CpG sites (DMCs) between TIIC and PBMC DCs were identified using DSS v.2.38.0 (smoothing = TRUE and p value < 0.001). We analyzed the distributions of DMCs located 1,250 bp upstream and downstream from the REs regulating immunogenic DC marker genes. We used seven well-established DC activation/maturation marker genes, HLA-DRA, CCR7, CD40, CCL22, IFNG, IL12, and CD86. For IL12, we used REs of IL12A and IL12B, since IL12 is a complex of them. We compared the methylation levels of DMCs obtained by DSS between TIIC and ll OPEN ACCESS